knitr::opts_chunk$set(fig.width=9,fig.height=6,fig.path=‘Figs/’, fig.align=‘center’,tidy=TRUE, echo=FALSE,warning=FALSE,message=FALSE)

We can see the variables descriptions of the data at the link https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0

To help ourselves understand the variables and the relationship between them and answer relevant questions, and since we have 81 variables, we’ll select the key ones by subseting our dataset. We’ll choose
LoanOriginalAmount BorrowerRate LoanStatus IncomeRange BorrowerState ListingCategory EmploymentStatus Occupation CurrentCreditLines ListingKey LoanKey Term OnTimeProsperPayments.

Subsetting the data

Univariate Analysis

First to get realy started with our data and get deep understanding of it, we’ll produce summaries and visualizations of each individual variable

## [1] 113937     81

Our Dataset consists of 81 variables with almost 113937 observations, we reduced the number of variable to 13.

We notice the presence of 3 peaks: about 15000 borrowers borrow an amount of 4000 dollars, 10000 dollars is borrowed by about 12000 of Prosper customers, and finally the third peak indicates that 12500 of customers have an original amount loan of 15000 dollars, maybe the three peaks are explained by the difference in monthly incomes and needs.

We notice that more than 5000 of borrowers pay on time just 10 times. The distribution is left skewed with a long tail.

More than 16000 of borrowers have 10 credit lines at the time the credit profile was pulled.

Most of borrowers are from big cities where the standard of living is somewhat expensive: California New York Texas Florida…

Factor w/ 21 levels “0”,“1”,“2”,“3”,..: 1 3 1 17 3 2 2 3 8 8 …

We transformed ListingCategory to a factor variable such that we can facet with it

Most of demanded loans are about debt consolidation
Factor w/ 8 levels “$0”,“$1-24,999”,..: 4 5 7 4 3 3 4 4 4 4 …

We notice the presence of outliers, especially for 1\(-24,900\) class, borrowers with high incomes have the greatest IQR and median.

It’s interesting to see whats high loan original amount borrowers occupation, but we have so much levels that we can’t visualize in a clean manner.

Bivariate Analysis

The log scale will be more informative since we have a multiplicative relationship between x and y

Besides over plotting, we see that for the same loan original amount we have a lot of degrees of borrower rates?!

Pearson's product-moment correlation

data: df_s\(CurrentCreditLines and df_s\)OnTimeProsperPayments t = 9.5808, df = 22083, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.05119330 0.07746164 sample estimates: cor 0.06433861

Expecting that it will be a negative correlation between current credit lines and on time prosper payments, it turns to be 0.06! the distribution seems to be somewhat normal and we also notice the presence of some outliers; for example, someone has about 60 credit lines at a time, but not even 50 on time payments!

Most of on time prosper payments concern debt consolidation and home improvement, comes in second range Household Expenses and Large Purchases.

Borrowers who choose 36 months as a length of the loan (middle term) have the highest Borrower Rate

Factor w/ 3 levels “12”,“36”,“60”: 2 2 2 2 2 3 2 2 2 2 …

The median and IQR are greater for the first term (12 months)

Borrowers with in income range of 25000 and 74999 dollars and divided their loan on 36 months are more prone to pay on time.

Now we’ll be interested to the mean of original amount loan by income range, before, we should order Income Range factor variable

Except the income range between 0$ and 24,999$, as we climb in income ranges the loan original amount increases.

We just discover an interesting relationship between Borrower Rate and Current Credit Line, let’s plot the scatter plot separately

Pearson's product-moment correlation

data: df_s\(BorrowerRate and df_s\)CurrentCreditLines t = -31.936, df = 106330, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.10342277 -0.09151586 sample estimates: cor -0.0974728 ?cor.test

Multivariate

It could be strange that not employed borrowers loan, in average more than borrows that earn more (0-24,999$ income range) Borrowers whose income is more than 100,000 dollars turn to have the greatest loan original amount, it seems interesting to discover, why do they borrow this money, in other words, in what Listing category they are interested in. We start by converting ,ListingCategory..numeric. to a categorical variable

Pearson's product-moment correlation

data: df_by_income\(CurrentCreditLines and df_by_income\)mean_loan t = 14.887, df = 7311, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1491985 0.1936903 sample estimates: cor 0.1715318

Before scaling y axis to log10, the data seemed to be spread, now it looks there exists an exponential relation be tween current credit lines and mean original amount loan by income range, this is confirmed by the correlation coefficient: 0.287

Pearson's product-moment correlation

data: df_by_income\(CurrentCreditLines and df_by_income\)median_loan t = 16.348, df = 7311, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1655904 0.2098144 sample estimates: cor 0.1877976

The relation between current credit lines and median loan seems to be stronger than the relation that relates it with mean loan since the coefficient is 0.318

Pearson's product-moment correlation

data: df_by_income\(mean_payments and df_by_income\)mean_loan t = 0.5392, df = 856, p-value = 0.5899 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.04856284 0.08525022 sample estimates: cor 0.0184262

Pearson's product-moment correlation

data: df_by_income\(CurrentCreditLines and df_by_income\)mean_rate t = -10.151, df = 7311, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.14043404 -0.09523097 sample estimates: cor -0.1178936

Pearson's product-moment correlation

data: df_by_income\(mean_rate and df_by_income\)mean_loan t = -31.095, df = 7316, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3617422 -0.3212647 sample estimates: cor -0.3416619

Where 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans.

We notice a negative correlation between mean rate and mean loan especially for debt consolidation and home improvement, as the mean loan amount increases, the rate decreased, we could be interested in calculating by how much of dollars the loan increases. >To do that we’ll create a new variable, called increase:

Pearson's product-moment correlation

data: df_s\(BorrowerRate and df_s\)increase t = -147.4, df = 113940, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.4050664 -0.3953132 sample estimates: cor -0.4002011

Final Plots

Principally, there is 3 loan amounts that are more frequent: 4000, 10000 and 15000 dollars. Most of borrowers are from big cities where the standard of living is somewhat expensive: California New York Texas Florida… Most of demanded loans are about debt consolidation.

Expecting that it will be a negative correlation between current credit lines and on time prosper payments, it turns to be 0.06! the distribution seems to be somewhat normal and we also notice the presence of some outliers; for example, someone has about 60 credit lines at a time, but not even 50 on time payments! For the same loan original amount we have a lot of degrees of borrower rates. Most of on time prosper payments concern debt consolidation and home improvement, comes in second range Household Expenses and Large Purchases. Borrowers who choose 36 months as a length of the loan (middle term) have the highest Borrower Rate. Except the income range between 0$ and 24,999$, as we climb in income ranges the loan original amount increases.

Pearson's product-moment correlation

data: df_by_income\(CurrentCreditLines and df_by_income\)mean_loan t = 14.887, df = 7311, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1491985 0.1936903 sample estimates: cor 0.1715318

Before scaling y axis to log10, the data seemed to be spread, now it looks there exists an exponential relation between current credit lines and mean original amount loan for each income range, this is confirmed by the correlation coefficient: 0.287

It could be strange that not employed borrowers loan, in average more than borrows that earn more (0-24,999$ income range) Borrowers whose income is more than 100,000 dollars turn to have the greatest loan original amount, it seems interesting to discover, why do they borrow this money, in other words, in what Listing category they are interested in. The relation between current credit lines and median loan seems to stronger than the relation that relates it with mean loan since the coefficient is 0.318 We notice a negative correlation between mean rate and mean loan especially for debt consolidation and home improvement, as the mean loan amount increases, the rate decreased, we could be interested in calculating by how much of dollars the loan increases.

Reflections

Roughly the course was extensive, I started from 0 and now I can explore a data set using R, oh it’s really wonderful, thanks Udacity.

I wanted to challenge myself by choosing a so complex data set, the first difficulty I encountered is the high number of features, I just read the variables description, and choose mentally most relevant ones, exploring about 10% of the variables makes me understand a big part of the data and make important conclusions, maybe by taking a course in feature selection I’ll do better. I think that I success in interpreting some key plots, discovering main trends and digest large amount of information.

As a future work, it will be relevant to find a model to predict the borrower rate using original amount loan, terms and payments. We can even find a model to predict the number of borrowers and which listing type they will demand.

It’s also worth thinking to explore new variables.